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Abstract 

The false discovery rate (FDR)—the expected fraction of spurious discoveries among all the 
discoveries—provides a popular statistical assessment of the reproducibility of scientific studies 
in various disciplines. In this work, we introduce a new method for controlling the FDR in 
meta-analysis of many decentralized linear models. Our method targets the scenario where 
many research groups—possibly the number of which is random—are independently testing a 
common set of hypotheses and then sending summary statistics to a coordinating center in an 
online manner. Built on the knockoffs framework introduced by Barber and Candes (2015), 
our procedure starts by applying the knockoff filter to each linear model and then aggregates 
the summary statistics via one-shot communication in a novel way. This method gives exact 
FDR control non-asymptotically without any knowledge of the noise variances or making any 
assumption about sparsity of the signal. In certain settings, it has a communication complexity 
that is optimal up to a logarithmic factor. 


1 Introduction 

Modern scientific discoveries are commonly supported by statistical significance summarized from 
exploring datasets. In our present world of Big Data, there are a number of difficulties with this 
scenario: an increasing number of hypotheses tested simultaneously, extensive use of sophisticated 
techniques, and enormous tuning parameters. In this pipeline, spurious discoveries arise naturally 
by mere random chance alone across nearly all disciplines including health care [19, 2, 15], machine 
learning [9, 7], and neuroscience [22], 

To address this challenge, the statistical community in the past two decades has developed a 
variety of approaches. A landmark work [3] proposed the false discovery rate (FDR) as a new mea¬ 
sure of type-I error for claiming discoveries, along with the elegant Benjamini-Hochberg procedure 
(BHq) for controlling the FDR in the case of independent test statistics. Roughly speaking, FDR is 
the expected fraction of erroneously made discoveries among all the claimed discoveries. Today, this 
concept has been widely accepted as a criterion for providing evidence about the reproducibility of 
discoveries claimed in one experiment. 

Our motivation for this work is further enhanced by the observation that scientific experiments 
are inherently decentralized in nature, where a given set of hypotheses are probed by several groups 
working in parallel. For an individual group, its access to datasets collected by the others is very 
limited. Then, challenges arise on how to statistically and efficiently perform meta-analysis of 
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results from all groups for controlling the FDR while maintaining a higher power (the fraction 
of correctly identified true discoveries) compared to an individual group. As a running example, 
imagine that an initiative is intended to study the genetic causes of autism across many research 
institutes. Due to the privacy and confidentiality of the datasets held by different institutes, it 
would be difficult to share full datasets. In contrast, aggregating small-volume summary statistics 
from each institute is a practical solution. Another issue is observed in different high-tech companies 
that hold background and behavioral information on thousands of millions of individuals, but are 
reluctant to share data for common research topics in part due to huge communication costs. 

1.1 Problem Setup and Contributions 

To formalize the problem considered throughout the paper, suppose we observe a sequence of linear 
models 

where the design X 1 € M. niXp and the response y l € M n * are collected by the ith group, the error 
term z l has i.i.d. Af(0, af) entries, and the signal (3 i £ may vary across different groups. Keeping 
in mind that a (sufficiently) strong signal in one of the groups is adequate to declare significance 
in the meta-analysis finding, we are interested in any feature j that obeys /3) 7 ^ 0 for at least one i\ 
for any model selection procedure returning a set of discoveries S C {1 ,,p}, the false discovery 
proportion (FDP) is defined as 


#\l<j<p m -j€S and = 0 for all i 1 
FDP = —^^(1) 

max-diS), 1} 

and FDR is the expectation of FDP. We assume that each group has only access to its own data, that 
is, (y\X l ), and reports summary statistics encoded in about 0(p ) bits to a coordinating center. 
In this protocol, we aim to achieve the exact FDR control by only making use of the information 
received at the center. Our approach, referred to as knockoff aggregation, is built on top of the 
knockoffs framework introduced by Barber and Candes [1]. The knockoff filter remarkably achieves 
exact FDR control in the finite sample setting for a single linear model whenever the number of 
variables is no more than the number of observations. The validity of the method does not depend 
on the amplitude or sparsity of the unknown signal, or any knowledge of the noise variance. In 
sharp contrast, the BHq is only known to control the FDR for sequence models under very restricted 
correlation structures [4] apart from the independent case [3]. 

Some appealing features of the knockoff aggregation are listed as follows. Inherited from the 
knockoffs, our method also controls the FDR exactly for general design matrices in a non-asymptotic 
manner and does not require any knowledge of the noise variances of linear models. Apart from 
these inheritances, knockoff aggregation provides more refined information on the significance of 
each hypothesis by aggregating many independent copies of the summary statistics, resembling 
the multiple knockoffs as briefly mentioned in [1], This property not only improves power by 
amplifying the signal, but also allows control of a generalized FDR which incorporates randomized 
decision rules. Due to the one-shot nature, this method only costs 0(p ■ ^linear models) bits in 
communication up to a logarithmic factor used in quantizing scalar summary statistics. We also 
propose a simple example where this rate of communication complexity is nearly optimal from an 
information-theoretic point of view. 
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2 Preliminaries 


In this section, we give a concise exposition of the knockoff filter [1]. Consider 

y = Xf3 + z, 

where the design X is n by p and noise term 2 consists of n i.i.d. jV(0, cr 2 ) entries. The knockoffs 
framework assumes the number of variables p is no more than the number of measurements n and 
the design matrix X has full rank. This is used to ensure model identifiability since otherwise 
there exists a non-trivial linear combination of the p features Xj that sums to zero. Moreover, we 
normalize each column: | Xj \|2 = 1 for all 1 < j < p. 

This method starts with constructing the knockoff features X € M nxp that satisfies 

X T X = X T X, X T X = X T X - diag(s), (2) 

where s € has nonnegative entries. To understand the constraints, observe that the first equality 
forces X to mimic the correlation structure of X. The second equality further requires any original- 
knockoff pair Xj,Xj have the same correlation with all the other 2p — 2 features. In a nutshell, 
the purpose of knockoff design is to manually construct a control group as compared to the original 
design X . 

The next step is to generate statistics for every original-knockoff pair. Denote by X KO = 
[X,X\ € M nx2p , the augmented design matrix. The reference paper suggests choosing the Lasso 
on the augmented design as a pilot estimator: 

/3(A) = argmin -||y - X KO b||| + A||6||i. 

6gR 2 P 1 

Then, let Zj = sup{A : (3j(\) 7 ^ 0}. Similarly, define Zj for the knockoff variable Xj. Then, 
a recommended choice of the knockoff statistics are (different notation is used for the ease of 
exposition) 

Wj = ma x{Zj,Zj}, Xj = sgn (Zj - Zj). 

where sgn(x) = — 1 , 0,1 depending on whether a;< 0 , x = 0 ,x >0 respectively. As a matter of fact, 
many alternative knockoff statistics can be used instead, as emphasized in the reference paper. For 
instance, Wj can take the form of any symmetric function of Zj and Zj. Furthermore, the use of 
the pilot estimator is not necessarily confined to the Lasso; alternatives include least-squares, least 
angle regression [ 12 ], and any likelihood estimation procedures with a symmetric penalty (see e.g. 
[13, 25, 5]). 

The following lemma, due to [1], is essential for the proof of FDR control of our knockoff 
aggregation. As clear from ( 1 ), we call j a true null when /3j = 0 and a false null otherwise. 

Lemma 2.1. Conditional on all false null \j an d a M Hj, all true null Xj are jointly independent 
and uniformly distributed on { — 1,1}. 

This simple lemma follows from the delicate symmetry between Xj and its knockoff Xj, which 
is guaranteed by the construction (2). The result implies that each Xj can be interpreted as a 
one-bit p- value, in the sense that it takes 1 or —1 with equal probability if fij = 0. In the case of 
large \/3j\, we shall expect that Xj is more likely to take 1 since the original feature Xj has more 
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odds to enter the Lasso path earlier than Xj. This lemma also suggests ordering the hypotheses 
based on the magnitude of Wj so that hypotheses that are more likely to be rejected would be 
tested earlier. Good ordering of hypotheses is a key element for improving power in sequential 
hypothesis testing (see e.g. [14, 21]). 

3 Aggregating the Knockoffs 

We recap the problem: Observe a sequence of decentralized linear regression models with the same 
set of hypotheses of interest, 

y i = X i (3 i + z i (3) 

for 1 < i < m. The design matrix X 1 is n* by p and z l consists of i.i.d. centered normals. The error 
terms z 1 are jointly independent and are allowed to have different variance levels. The number of 
observed models m is not necessarily deterministic, but must be independent of the randomness of 
all z l . We are interested in simultaneously testing 

(that is, there is no effect of feature j in all studies) versus 

H a j : at least one j3j f 0 

for 1 < j < p. To fully utilize the knockoffs framework, we assume > p for all i > 1. 

Our aggregation starts by running the knockoff filter with an arbitrary pilot estimator for each 
linear model (3), which provides us with the ordering statistics W\, ... Wp and the one-bit p -values 
Xj, ■ ■ ■, Xp- Then, for each 1 < j < p, we aggregate Wj ,..., W™ to produce Wj that measures 
the rank of the jth hypothesis: the larger Wj is, the earlier the hypothesis Hqj is tested. Let this 
summary statistic take the form Wj = YfWj ,..., WJ n ) for some nonnegative measurable function 
T defined on M™. Recognizing that a large Wj provides evidence of significant rank of the corre¬ 
sponding hypothesis, we are particularly interested in summary functions T that are non-decreasing 
in each coordinate. No further conditions of T are required. Examples include T(x\,... ,x m ) = 
max{xi,..., x m }, T(xi,..., x m ) = the sum (or product) of the r largest of x \,... x m for some 1 < 
r < m, and T(a;i,... ,x m ) = X^:=i n i x i- The first two examples are symmetric in the W-statistics 
and the last one incorporates sizes of the m models. 

With the ordering statistics Wj in place, we move to define the aggregated y-statistics by making 
use of the one-bit yj : 



which is simply the number of +1 of y], • • •, x'J' ■ The motivation behind this construction is simple. 
That is, the more winnings of the original feature Xj over its knockoffs Xj, the stronger evidence 
that ffj is nonzero. As will be shown in Lemma 3.2, under the null /3j = • • • = /3j” = 0, this 
aggregated Xj follows a simple binomial distribution so that it can be easily translated into a 
refined p- value. 

In passing, the content of this section by far is summarized in Algorithm 1. 
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Algorithm 1 Running the knockoff filter in parallel 

Require: X L ,. .., X m , y 1 , ..., y m and a summary function T. 

1: Run the knockoff filter for each model and get xjj ■ ■ ■> Xj 1 and Wj ,... , W™. 

2: (One-shot communication) Let Xj ^ T + ^ YllLi Xj and Wj -t- T(Wj ,..., WJ 71 ). 


3.1 Controlling the Weighted FDR 

Having defined the aggregated knockoff statistics, we now turn to control a generalized FDR. We 
call it the weighted false discovery rate (u>FDR) which includes the original FDR as a special 
example. In lieu of the accept-or-reject decision rule, we introduce a randomized decision rule 
that assigns each hypothesis Hoj a number cjj between 0 and 1. The closer ujj is to 1, the more 
confidence we have in rejecting the hypothesis Hqj. The definition the weighted FDR is given as 
follows. 


Definition 3.1. 

proportion as 


Given a randomized decision rule u € [0, l] p , define the weighted false discovery 


rcFDP = 


^2j =1 w jlnull j 


ifY^ P j=i u j > 0 and otherwise rcFDP = 0, where l nu n j = 1 if the hypothesis is a true null and 
otherwise 0. The rcFDR is the expectation ofwFDF. 


If the weights uij take only 0,1, then the rcFDR reduces to the vanilla FDR. In general, ujj 
can be interpreted as the probability, or confidence , of randomly rejecting Hqj. As a special 
case, rejecting a hypothesis with confidence 0 is equivalent to accepting it. The motivation for 
this generalized FDR is simple: The accept-or-reject rule behaves like a hard rule that may make 
completely different decisions for very close p-values, whereas randomization smoothes out this 
undesirable artifact. From a practical point of view, this generalized FDR has the potential to find 
applications in large-scale Internet experiments where randomized decisions occurred frequently. 
Randomization of testing also comes naturally from an empirical Bayesian framework (see e.g. 
[ 11 ])- 

As mentioned earlier, the particular form of the refined ^-statistics is highly motivated by 
the fact that Xj under the null hypotheses (i.e. ftj = ■ ■ ■ = ft™ = 0 ) are simply i.i.d. binomial 
random variables, no matter how complicated the joint distribution of Wj is. The following lemma 
formalizes this point, whose proof is just a stone away from Lemma 2.1. 


Lemma 3.2. Conditional on all false null Xj an d all Wj, all true null Xj are jointly independent 
and have binomial distribution B(m, 1/2). 

Therefore, the refined p-value for testing Hqj is naturally given by 



which, by definition, is stochastically smaller than the uniform distribution on [ 0 , 1 ]. 
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Now we turn to introduce Algorithm 2 which is a generalization of the original knockoff+ filter 
via incorporating a confidence function 12. We call 12 : [0,1] —> [0,1] a confidence function if 12 is 
non-increasing and obeys 

lim 12 (x) = 12 ( 0 ) = 1 , lim 12 (x) = 12 ( 1 ) = 0 . 

£—»()+ x— > 1 — 

This function is used to provide weights uj in rejecting the hypotheses. In the special case of 
12(x) = l x<c for some 0 < c < 1, this algorithm reduces to the Selective SeqStep+ in [1]. Interested 
readers are referred to [23] for generalizations in a different direction. Below, let U be uniformly 
distributed on [ 0 , 1 ]. 

We present the main result as follows, which generalizes Theorem 3 of [1]. The control of rcFDR 
in the finite sample setting holds for any summary function T and confidence function 12 . 

Theorem 3.3. Combining Algorithm 1 and Algorithm 2 gives 

rcFDR < q. 


Algorithm 2 Knockoff filter with aggregated statistics from Algorithm 1 


Require: xii • • • > Xp> Wi, ■ ■ ■, W p from Algorithm 1, nominal level q E (0,1), and confidence func¬ 
tion 12 . 

1 : Compute Pj = 2 ^r ET= Xj (?)■ 

2 : Order hypotheses according to the magnitude of the W-statistics: W p (i) > lF p ( 2 ) > ■ ■ ■ > W p ^, 
where p(-) is a permutation of 1 ,... ,p. 

3: Let k be 


max ■ 


1 + J2j= i(! - V(P pU) )) < q 




E12 (U) 


with the convention that max 0 = —oo. 

4: Reject all hypotheses // 0 ,p(j) f° r j < k with weight (conhdence) Uj = 12 (P p ^). 


The proof of the theorem relies on two lemmas stated below, which are parallel to Lemma 4 of 
[1]. We defer the proofs of these two lemmas to the Appendix. 

Lemma 3.4. Let 

k k 

v + (k ) = 12(P i )lnuU j, V~(k) = 2(1 - 12(Fj))l nu ii j. 


3 =1 


3 = 1 


Then, 


M(k) = 


V + (k ) 


1 + V~(k) 

is a super-martingale running backward in k with respect to the filtration which only knows all 
the false null Pj, and V + (k), V + {k + 1),..., V + {p). 

Lemma 3.5. For any integer N > 1, let U, U±,... , Um be i.i.d. uniform random variables on [0,1]. 
Then, 

r - AT _ . ~i 

E 12(17) 


E 


Ef=i n(^) 




< 


1 -E 12(17)' 


( 4 ) 
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Now, we turn to give the proof of Theorem 3.3. The idea of the proof is similar to that of 
Theorem 2 in [1]. 


Proof of Theorem 3.3. Recall that the weights ujj are given as uj = Fl(Pj) if Hqj is ranked among 
the first k hypotheses processed by Algorithm 2 and otherwise ojj = 0. Due to the independence 
between m and all z 1 . by a conditional argument the theorem is reduced to proving for a deter¬ 
ministic m. By Lemma 3.2, without loss of generality, we may assume W\ > ■ ■ ■ > W p . Then we 
get 


wFBP • 1^ 1 


Ej=l Inull j 


1 + Yj k j=l( l - ft{Pj)) inull j Y?j= 1 ft(Pj ) 1 null j 

E$=l 1 + Ej=l(l - ^(^))lnull j 

i + Ef=i(i - aim ZU WP ,)inuii j 

Ej=1 Pj ) 1 _ Q(Pj ) ) Inull j 


< 


g(l -ED(C/)) 
Efi(L) 


• M(fc). 


Recognizing that k is a stopping time with respect to P, we apply the Doob’s optional stopping 
theorem to the super-martingale M[k ) in Lemma 3.4, 


rcFDR < 

< 

< 


q(l-En(U)) 
EO([/) 
g(l -Efi([/)) 
EO([/) 
g(l-Efi([/)) 
E n(U) 


■E M{k) 


E M(p) 

E n(u) 
l-EO(L) 


= q, 


where the last inequality follows from Lemma 3.5 and the observation that Fl(Pj) is stochastically 
dominated by il(Uf). 

□ 


3.2 Controlling Other Error Rates 

While it is out of the scope of the present paper, we would like to briefly point out that the knock¬ 
off aggregation can be applied to control other type-I error rates, including the fc-familywise error 
rate (fc-FWER) [18], y-FDP [16], and per-family error rate (PFER) [17]. These error rates have 
different interpretations from the FDR and are more favorable in certain applications. Interested 
readers are referred to a recent work [20] where some attractive features of the knockoffs frame¬ 
work are translated into a novel procedure for provably controlling the &-FWER and PFER. With 
more refined information, the knockoff aggregation has the potential to improve power while still 
controlling these error rates. 
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4 Communication Complexity 


The knockoff aggregation is communication efficient due to its one-shot nature. For each decentral¬ 
ized linear model, the message sent to the coordinating center is merely the sign information x) and 
the ordering information W*. This piece of information can be encoded in O(ppoly(logp)) bits, 
where the polylogarithmic factor is used in quantizing each WJj depending on the accuracy required. 

Hence, the total bits of communication is 0(mp). We can further get rid of this logarithmic factor 
by forcing Wj to take only 0 or 1, respectively, depending on whether the original W'f is below the 
median of the original W[, ..., W p or not. 

It would be interesting to get a lower bound on the total communication cost required to 
control the FDR while maintaining a decent power. A trivial bound as such is Ll(p) since it needs 
p bits to fully characterize the support set of (3 l (in this section D(-) is the Big Omega notation in 
complexity theory, instead of the confidence function). In general, this bound is unachievable since 
the summary statistics from each decentralized model are obtained by using only local information. 
To shed light on this, we provide a simple but illuminating example of (3) where ®{mp) is the 
optimal communication cost in achieving asymptotically vanishing FDP and full power up to a 
polylogarithmic factor. To start with, fix the noise level af = 1. All f3 l are equal to a common 
/3. Let the design matrix X 1 of each decentralized model be a (2 p) x p matrix with orthonormal 

columns, and each /3j independently take p := with probability half and 0 otherwise. We 

further assume that [3 is independent of z l , and both p, m —>• oo but do not differ extremely from 
each other in the sense that p = 0(e m ) and m = 0(poly(p)). This condition allows m x log p 
or m x p a for arbitrary a >J0. Here, the summary statistics are defined as follows: We regress y l on 
the augmented design [X ! , X 1 } (X 1 is an orthogonal complement of X 1 ), obtain the least-squares 
estimates /3®, /3® for each 1 < j < p, and then take x) = sgn(/3) — /?*■) and Wj = 1 or 0, respectively, 

depending on whether |/3*- — /?*■ | is above the median of \(3\ — /3j|,..., |/3* — f3 p \ or not. 

Under the preceding assumptions, the knockoff aggregation almost perfectly recovers the sup¬ 
port set of the signal (3, with a total communication cost of 0(mp). Let V € {0,1} P be constructed 
as Vj = 1 if f3j 7 ^ 0 and otherwise Vj = 0 (so V is uniformly distributed in the cube {0,1} P ). Sim¬ 
ilarly, the output of the knockoff aggregation, denoted as Vko, takes the form of V KO j = 1 if Hqj 
is rejected and V KO j = 0 if Hqj is accepted. Last, denote by Hamm(-, •) the Hamming distance. 

Proposition 4.1. Let e be any positive constant. With probability tending to one, this knockoff 
aggregation with slowly vanishing nominal levels q obeys 

Hamm (Vro, V) < ep. 

Hence, the knockoff aggregation is capable of distinguishing almost all the signal features from 
the noise features, resulting FDP —>• 0 and power —>• 1. The nominal level q is spelled out in the 
proof. 

Next, we move to give the information-theoretic lower bound. Denote by M l the message sent 
by the zth model, which only depends on the local information y l ,X l . Then the coordinating 
center makes decisions V € {0, 1} P to reject or accept each of the p hypotheses solely based on the 
m pieces of messages AT 1 ,..., M m . In other words, the protocol is non-interactive. Let L l be the 
minimal length of M l in bit, with a preassigned budget constraint E(L 1 + • • • + L m ) < B. The 
proof of the result below uses tools from [ 8 ]. 



Proposition 4.2. Let e and C be arbitrary positive constants. If the total communication budget 

Cmp 
log p 

then for any non-interactive protocol, 

Hamm(y, V) > —— p 

holds with probability tending to one. 

Incidentally, the exponent 2.1 can be replaced by any constant greater than 2. To appreciate 
this result, note that randomly flipping a coin for each hypothesis would have a Hamming distance 
about p/2 from the true support set V. Hence, it is hopeless to draw any statistically valid 
conclusion based on 0(mp/ log 2+0 ^ p) bits of information in the distributed setting. In a nutshell, 
for our example a communication budget of 0(mp) up to a logarithmic factor, is both sufficient 
and necessary for recovering the true signal. 

5 Numerical Experiments 

In this section, we test the performance of the knockoff aggregation under a range of designs with 
different sparsity levels and signal strengths. Recall that we have m decentralized linear models 

y i = X*/? + z l 

for i = 1,... ,m, where X 1 £ M" iXp . Our setup is similar to [1]. First, the rows of X 1 are drawn 
independently from J\f(0, £), and are also independent of other sub-models. The columns of each 
X 1 are then normalized to have unit length. Second, given a sparsity level k, we randomly sample 
k signal locations and set /?*• = A for each selected index j and all i, where A is a fixed magnitude. 

Last, we fix the design and repeat the experiment by drawing y ~ Af(X/3,1). Nominal FDR level 
is set to be q = 0.20. 

5.1 Power Gains with FDR Control 

Our first experiment tests the performance across different m as the sparsity level or signal strength 
varies. To save space, here we only show the results for the models with independent features, 
i.e. £ is diagonal. Similar patterns still hold under correlated designs. We take p = 1000, and 
ni = 3000 for each i. Fix £ = I and m = 5. In this scenario, we take A = 1.2^(21ogp)/5 ~ 1.99, 
where y 7 (21ogp)/5 is the universal threshold for detection if we had access to the entire datasets 
of the m = 5 decentralized models, and 1.2 is a compensation factor for information loss in our 
communication-efficient aggregation. Each experiment is repeated 30 times. 

In the knockoff aggregation, we are allowed to choose the summary function T (in Algorithm 1) 
and confidence function H (in Algorithm 2). The choice can be made adaptively to different m, n,p 
and the design structure. In the following simulation, we take T(Wj, ..., IF" 1 ) = Y^T= l n *^7> an d 
P(P?) = I Pj <0.5- 

Figure 1 shows the FDR and power achieved by knockoff aggregation with fixed signal strength 
A = 1.99 and varying sparsity levels k = 10,30,50,100, as well as with fixed sparsity k = 30 
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and varying strengths A = c^J (21ogp)/5, where c = 1.0,1.2,1.5, 2.0. Recall that the power is the 
fraction of identified true discoveries among all the k potential true discoveries. We see that our 
procedure can effectively control the FDR for different m in both cases. Meanwhile, as m increases 
power gains are significant even we only need 0(mp ) bits of communication. 



Figure 1 : Top row: mean FDP and power versus sparsity level k with fixed strength A = 1.99. 
Bottom row: mean FDP and power versus amplitude level A with fixed sparsity k = 30. We 
use i.i.d. design with p = 1000, rii = 3000 and nominal level q = 0.20. 


5.2 Comparison with Other Methods 

We compare the knockoff aggregation with other methods, such as the least-squares (OLS) and the 
Lasso. For the OLS, we consider the following procedure. For each i, we have the OLS estimator j3 l 
based on (X l ,y l ). This estimator obeys (3 l ~ M{f3, ©*), where 0* = ((X l ) T X l ) _1 . Then, (3 l and 
the corresponding marginal variances (&jj)i<j<p are aggregated from the m nodes as follows. Let 
Pi = 2i=i Pj/m be an averaged estimator of /3j and &j = (1/m 2 ) l ®jj hs variance. Set the 

z-score Zj = Pj/^/Qj for testing Hqj. Note that Zj ~ jV(0,1) marginally when /3j = • • • /3J 1 = 0. 
Ignoring the correlations, we apply the BHq procedure directly to the p-values derived from the 
z-scores. We simply call this OLS for convenience hereafter. 

For the Lasso, we take the following approach. Given data (X l ,y l ), we compute the Lasso 
estimates in parallel 

^Lasso = argmin -\\y l - X l b\\l + A,;||h||i, 

6e»p z 

where \ > 0 is a regularization parameter and is often selected by cross-validation. In particular, 
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we choose the largest value of Aj such that the cross-validation error is within one standard error 
of the minimum. For each i, the support set of /^Lasso sen t to the center and a majority vote is 
applied to determine whether to accept Hqj or not. 

To compare these methods, we consider a correlated design as an illustration. Let p = 500, rii = 
1500, and m = 5. To generate the rows Xj, we set = 1 if i = j and = —0.3/(0.3 • (p — 2) +1) 
otherwise. Fix the sparsity level k = 100 and signal strength A = 5-%2 logp. In this setting, while 
the powers of these procedures are essentially 1, their behavior in FDP shown in Table 1 is very 
distinct. 



Mean of FDP 

SD of FDP 

Knockoff Aggregation 

0.1774 

0.0493 

OLS 

0.1433 

0.1260 

Lasso 

0.3458 

0.0405 


Table 1: FDP of knockoff aggregation, OLS and Lasso. 


The Lasso with a cross-validated penalty lacks a guarantee of FDR control (see e.g. [24]). 
In the case of correlated designs, its empirical FDR 0.3458 is way higher than the nominal level 
q = 0.20. Despite the fact that we choose a sparser model in cross-validation, Lasso still tends to 
select more variables than necessary. In terms of controlling false discoveries, the Lasso does not 
give a satisfactory solution. In contrast, the mean FDP of the knockoff aggregation and that of 
the OLS are both under the nominal level, though the former is slightly higher than the latter. 
However, more importantly, as shown in Figure 2, the FDPs of the knockoff aggregation are tightly 
concentrated around the nominal level, while those of the OLS are widely spread—sometimes the 
proportions of false discoveries can be as high as 70% for the OLS. For information, the estimated 
standard deviation of the knockoff FDP is 0.0493, while, in stark contrast, that of the OLS FDP is 
0.1260—almost 3 times higher. Such high variability is undesirable in practice. Researchers would 
not like to take the risk of having 70% false discoveries in any study. 





Figure 2: Histogram of the FDPs by knockoff aggregation, OLS, and Lasso with 200 replicates. 
m = 5,p = 500, and rii = 1500. Sparsity level k = 100 and signal strength A = 5-^2 log p. 
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6 Discussion 


We introduce a communication-efficient method for aggregating the knockoff filter running on many 
decentralized linear models. This knockoff aggregation enjoys exact FDR control and some desired 
properties inherited from the knockoffs framework. Simulation results provide evidence that this 
proposed method exhibits nice properties in a range of examples. 

Many challenging problems remain and we address a few of them. An outstanding open problem 
is to generalize the knockoffs framework to the high-dimensional setting p > n. This would as well 
help the knockoff aggregation cover a broader range of applications. In addition, the flexibility of 
the use of the link functions D and T leaves room for further investigation: For example, does there 
exist an optimal D or T? Can these functions be chosen in a data-driven fashion? Last, it would 
be interesting to incorporate differential privacy (see e.g. [10]) in the stage of aggregation, which 
could lead to much stronger protection of confidentiality. 
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A Technical Proofs 

Proof of Lemma 3.2. Denote by S the set of indices of all false null hypotheses, that is, S = 
{1 < j < p : at least one /?*• / 0}. Similarly, S c corresponds to the true null hypotheses. For 
each i, conditional on W l := (W /,..., Wff) and Xsi from Lemma 2.1 it follows that xf sc has 
i.i.d. components uniformly distributed on {—1,1}. Since the m linear models are independently 
generated, we thus see that the concatenation of ^,1 < i < m are uniformly distributed on 
{—1, conditional on all W' 1 and all x*s- Then the proof immediately follows by recognizing 

that Wj and Xj only depend on Wj, 1 < i < m and y4,l < i < m, respectively. The binomial 
distribution follows from the fact that Xj is simply the number of 1 in 1 < i < m. 

□ 

Proof of Lemma 3.f. We use some ideas from the proof of Lemma 4 in [1]. Given the filtration J- k , 
we know all V~{k ),..., V~(p) since it always holds that V~{k') + V + {k') = ff{j null : j < k'} for 
all k' > k. Without loss of generality, we assume all the hypotheses are true due to the observation 
that M(j) agrees with M(j — 1) if the jth hypothesis is true. Write V + {k) = D(Pj) = A. By 
the exchangeability of Lt(Pi), ..., Q(Pk), we get 

E(V + (k - 1 )| F k ) = E(V + (k - 1 )| V + {k) =A) = ^A. ( 5 ) 

Hence, we have 


E(M(k - l)\F k ) = E{M{k - 1)| V + (k) = A) 


= E 


E k jZl^(Pj)/k 

1-E Snw 


V + {k) 



To proceed, note that x/(l — x) is convex for x < 1. Since E)=i ^(-Pj) € [A — 1 ,A\ almost surely, 
by the inverse Jensen inequality we get 


Ej=i tt(Pj)/k 

1-E,t l^/k 


V + (k ) = A) 


< 


(A-l)/k 

l-{A-l)/k r,+ 


A/k 
1 — A/k 


0--v), 


where 77 = A/k is provided by (5). Simple calculation reveals that 


as desired. 


(A - 1 )/* 
1-(A- l)/k V + 


A/k 
l-A/k 


0--v) 


A 

1 + k — A 


M(k ), 


□ 


Proof of Lemma 3.5. Write a = Efl([7) € (0,1). Note that each summand Ll(Uj) obeys 0 < 
Ll(Uj) < 1 and EH(17j) = a. We assert that the right-hand side (RHS) of (4) attains the maximum 
if each Li{Uj) is replaced by i.i.d. Bernoulli random variable B(a). (Note that B(a ) assumes only 
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0,1 and obeys E B(a) = a.) To see this, we examine which fl(Ui) gives the maximum conditional 
expectation while J2j =2 i s fixed. Write 

e|=i n(Uj) = mi)+Ef=2^ u j) (6) 

1 + Ef=i(i - mj)) n +1 - Ef = 2 n(^) - 

For any constants c\ > 0,C2 > 1, the function (x + ci)/(c 2 — x) is convex on [0,1]. Hence, the 
inverse Jensen’s inequality gives 


E 




< E 


/ B(a) + ci \ 
Vc 2 - B(a)J 


Applying the last display to (6), we see that the RHS of (4) shall never decrease if each Q(Uj) is 
replaced by i.i.d. B(a). As a consequence, it only remains to show 


E 


( B(N,a) \ 
\l + N-B(N,a)J 


< 

~ 1 - 


which has been established in the proof of Lemma 4 in [1], This finishes the proof. 


□ 


Proof of Lemma f.l. We start with the observation that 

HX) = 1| Pj = fi) = \ + (1 + o(l))y/\ogp/m. 


Hence, Xj = m /2 + (1 + op(l))y / mlogp if ,8j = g for a hxed j. By the central limit theorem, 
Xj = m/2 + Op(- v /m) if 8j = 0. Set the confidence function Q(x) = 1 x < Cp for some slowly vanishing 
sequence c p . Then, as logp —> 00 when taking the limit p —> 00 , we have 


#{j : ft = (iMPj) = 1} ; 1 

p 2 

#{j:f3 j = Q,n(P J )=Q} 1 

P 2 


with probability tending to one, which implies that 

i + E^i(i-^o))) 

E U^PpU)) 


for any permutation p(-). Then the rejection rule k given by Algorithm 2 takes p with probability 
approaching one since the targeted upper bound q/KQ(U) — q = q/c p — q > 1 asymptotically (set 
c p <^L q = q p ). Recognizing that for almost all j the weights Ll(Pj) = 1 if and only if j3j 0. This 
is equivalent to saying that V has only a vanishing fraction of indices that do not agree with V. 

□ 


Proof of Proposition f.2. For the sake of generality, replace 2.1 by 2 + ei and e by e 2 . The proof 
makes extensive use of Lemmas 2, 4, and 6 of [8]. Take 5 = 1/2, cr = 1/p in Lemma 6 of [8], and 


a = 
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First, we show that 


I(V;M) = o(p). (7) 

Recognizing n l = 1 (this is the notation used in [ 8 ], not in our problem setup), we see that the pair 
(a, 5) satisfies Eqn. (18) of [ 8 ] for sufficiently large m. Then, combined with Lemma 4, Eqn (19b) 
gives 


i =1 

r 2 2 m 

< 128 —+ prnh-i (p*) + pmp* 

(7 4 i 

m 

) + pmh2(p* ) + pmp* 
+ pm/i2(<?*) + pmg*, 


i=l 


< 


< 


32log 2+ 2 p 
m 


32 B log 2+ 2 p 


m 


h 


h 


where the last inequality follows from Shannons source coding theorem [ 6 ]. This inequality asserts 
that (7) is a simple consequence of 

h = o{p) 

for all Z = 1,2, 3. Now we move to prove ( 8 ). For l = 1, we have 


( 8 ) 


j- _ 32 B log 2+ 2 p _ o 


mp log 2+ 2 p 


m 


log 2+ei p m 


= O 


p 


log 2 p 


= o(p). 


For l = 2,3, note that 


q = min < 2 e 


(q-0.5) 2 1 


log + 2 p. 


Since a 1, it follows that the exponent 

(a - 0.5 ) 2 

2 ^ 

As a consequence, 

q * = exp 0 (log 1 + l" p)^ = o(l). 
Thus, I 3 = 0 ( 12 ), and I 2 further obeys 

h x — pmq * log g* 

pm log 1 ' 1 '! 1 p 


exp ^ 0 (log 1+ 2 p^j 


= o(p ), 


which makes use of m = 0(poly(p)). This proves (7). 

Next, we proceed to finish the proof by resorting to Lemma 2 of [ 8 ]. To this end, set t = 
(0.5 — £ 2 )p (if £2 > 0.5 then there is nothing to prove). Then, Lemma 2 yields 

I(V ; M) + log 2 


(Hamm(y, V) > (0.5 — £ 2 )p) > 1 — 


!°gf 


(9) 
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where, in our setting, N t = {v £ {0 , M} p : ||u||o < (0.5 — £2 )p}- By the large deviation theory, we 
get 

2 p 

log — ~ (log 2 + (0.5 - e 2 ) log(0.5 - e 2 ) + (0.5 + e 2 ) log(0.5 + e 2 ))p x p. (10) 

Mt 

Substituting (7) and (10) into (9), we obtain 

P ^Hamm(y, V ) > (0.5 — e 2 )p^j > 1 — o(l), 

which completes the proof. 

□ 
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